{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 用 LangChain 自制 chatPDF"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们需要做的关键步骤包括：加载文档，分割文档，创建嵌入，将嵌入放入向量存储。接下来，我们将使用 LangChain 创建我们的链，以便我们可以进行查询，查找向量存储中的信息，将其带入链中，然后与问题结合，最终得出一个答案。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 环境和工具"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "[notice] A new release of pip is available: 23.3 -> 23.3.2\n",
      "[notice] To update, run: C:\\Users\\freestone\\AppData\\Local\\Microsoft\\WindowsApps\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\python.exe -m pip install --upgrade pip\n"
     ]
    }
   ],
   "source": [
    "!pip -q install langchain openai tiktoken PyPDF2 faiss-cpu"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "设置密钥"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"填入您的OpenAI开发密钥\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 下载 PDF 文本 \n",
    "复制到浏览器后，直接保存在本地电脑上。https://www.impromptubook.com/wp-content/uploads/2023/03/impromptu-rh.pdf"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入库和 PDF 加载器\n",
    "\n",
    "首先，我们需要一个 PDF 阅读器。虽然我们这次只使用了一个基础的 PDF 阅读器，你也可以根据需要选择更合适的 PDF 阅读器。我们通过 PDF 阅读器将 PDF 文档读取成一个长字符串。这个过程可能会遇到一些格式问题，比如奇怪的空格等，但每个项目都有自己独特的数据处理方法。不论是使用基础的处理方式，还是使用诸如'unstructured'库、AWS 或 Google Cloud 的 API，都有可能。\n",
    "\n",
    "\n",
    "先导入阅读器 `PdfReader`, 嵌入模型 `OpenAIEmbeddings`, 文档切分器 `CharacterTextSplitter`, 向量存储库 `FAISS`  ： "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from PyPDF2 import PdfReader\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import FAISS "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "加载之前准备好的 PDF 素材："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 括号内填入当前电脑保存pdf文件的位置. \n",
    "doc_reader = PdfReader('./impromptu-rh.pdf')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "打印 `doc_reader` 看看 :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<PyPDF2._page._VirtualList at 0x1cc85ed5b50>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_reader.pages"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将 PDF 文档转化为可用的 `raw_text` 格式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read data from the file and put them into a variable called raw_text\n",
    "raw_text = ''\n",
    "for i, page in enumerate(doc_reader.pages):\n",
    "    text = page.extract_text()\n",
    "    if text:\n",
    "        raw_text += text"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以打印 `raw_text` 的结果字符串长度，看看是不是转换成功了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "371090"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(raw_text) # 得到 371090 的结果"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  文本拆分\n",
    "\n",
    "我们需要将获取到的长字符串拆分为适合分析的小段落。\n",
    "\n",
    "我们的方法很简单，就是将这个长字符串按照字符数拆分。比如我们可以设定每 1000 个字符为一个块, `chunk_size = 1000`。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Splitting up the text into smaller chunks for indexing\n",
    "text_splitter = CharacterTextSplitter(        \n",
    "    separator = \"\\n\",\n",
    "    chunk_size = 1000,\n",
    "    chunk_overlap  = 200, #striding over the text\n",
    "    length_function = len,\n",
    ")\n",
    "texts = text_splitter.split_text(raw_text)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们总共切了 466 个块："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "466"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(texts) # 448"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*注意*： 在这个代码片段中，`chunk_overlap` 参数用于指定文本切分时的重叠量（overlap）。它表示在切分后生成的每个分块之间重叠的字符数。具体来说，这个参数表示每个分块的前后，两个分块之间会有多少个字符是重复的。举例来说 chunkA 和 chunkB, 他们有 200 个字符是重复的。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后，我们采用滑动窗口的方法来拆分文本。即每个块之间会有部分字符重叠，比如在每 1000 个字符的块上，我们让前后两块有 200 个字符重叠。这样做的目的是避免关键信息被切分，而且即使有些信息出现在了多个块中，因为我们是在获取整体语义，所以这些重叠的块在语义上也会有所区别。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以随机打印一块的内容："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Because, really, an AI book? When things are moving so \\nquickly? Even with a helpful AI on hand to speed the process, \\nany such book would be obsolete before we started to write it—\\nthat’s how fast the industry is moving.\\nSo I hemmed and hawed for a bit. And then I thought of a frame \\nthat pushed me into action.\\nThis didn’t have to be a comprehensive “book” book so much as \\na travelog, an informal exercise in exploration and discovery, \\nme (with GPT-4) choosing one path among many. A snapshot \\nmemorializing—in a subjective and decidedly not definitive \\nway—the AI future we were about to experience.\\nWhat would we see? What would impress us most? What would \\nwe learn about ourselves in the process? Well aware of the brief \\nhalf-life of this travelog’s relevance, I decided to press ahead.\\nA month later, at the end of November 2022, OpenAI released \\nChatGPT, a “conversational agent,” aka chatbot, a modified \\nversion of GPT-3.5 that they had fine-tuned through a process'"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "texts[20]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  创建嵌入和检索"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有了分好的小块文本，我们就可以为这些文本创建嵌入了。在这个步骤，我们使用了 OpenAI 的嵌入技术。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download embeddings from OpenAI\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们先把文本传给嵌入制造器，然后通过 FAISS 库创建向量存储本身。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "docsearch = FAISS.from_texts(texts, embeddings)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "至此，我们已经将原本的 PDF 文档转化为了可以进行机器学习的向量数据。\n",
    "\n",
    "就是这么简单，接下来我们就可以向这个 PDF 问问题了。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 相似度检索"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，我们可以使用这些向量数据来进行搜索匹配了。我们以一个实际的查询为例：“GPT-4 如何改变社交媒体？”。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"GPT-4 如何改变了社交媒体?\"\n",
    "docs = docsearch.similarity_search(query) # 这是搜索匹配的文字结果数组。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='cian, GPT-4 and ChatGPT are not only able but also incredi-\\nbly willing to focus on whatever you want to talk about.4 This \\nsimple dynamic creates a highly personalized user experience. \\nAs an exchange with GPT-4 progresses, you are continuously \\nfine-tuning it to your specific preferences in that moment. \\nWhile this high degree of personalization informs whatever \\nyou’re using GPT-4 for, I believe it has special salience for the \\nnews media industry.\\nImagine a future where you go to a news website and use \\nqueries like these to define your experience there:\\n4  Provided it doesn’t violate the safety restrictions OpenAI has put on \\nthem.93Journalism\\n● Hey, Wall Street Journal, give me hundred-word summa-\\nries of your three most-read tech stories today.\\n● Hey, CNN, show me any climate change stories that hap-\\npened today involving policy-making.\\n● Hey, New York Times, can you create a counter-argument \\nto today’s Paul Krugman op-ed, using only news articles', metadata={}),\n",
       " Document(page_content='it has significantly improved on GPT-3’s capabilities, 115Social Media\\nCodysseus could leverage the large and diverse corpus \\nof his own written and spoken content to train a custom \\nmodel that captures his unique voice and style. He could \\nalso use additional data sources, such as feedback from \\nhis audience, social media interactions, and relevant top -\\nics and trends, to further refine and update his model. \\nBy fine-tuning GPT-4 on his own data, Codysseus could \\npotentially generate new content that is consistent, \\ncoherent, and creative, and that reflects his personality \\nand preferences. He could also use his model to interact \\nwith his fans, answer questions, and provide commen-\\ntary on various issues. However, Codysseus would also \\nneed to be careful and ethical in using his model, and \\nensure that he discloses and acknowledges the use of \\nGPT-4 to his viewers and followers. He would also need \\nto monitor and evaluate his model’s outputs, and correct', metadata={}),\n",
       " Document(page_content='GPT-4:  There are different versions and applications of \\nGPT that can produce different types of output, such as \\nimage generation, data analysis, or code writing. Simi-\\nlarly, there are other GPT models that can specialize in \\ngenerating 3D models, lighting effects, or audio editing, \\nbased on text input or other modalities. For example, \\nDALL-E is a GPT-3 variant that can create images from \\ntext descriptions, while Jukebox is a neural network that \\ncan generate music and lyrics. These models are not \\nperfect, but they demonstrate the potential of GPTs to \\nhelp improve the creative practices of various types of \\nartists and creators.61Creativity\\nAnd then, this minor wow  moment: GPT-4, of its own “voli-\\ntion” (if we can speak of such a thing), added something I didn’t \\nask for but that it correctly inferred I would want to know.\\nGPT-4:  To make the examples more realistic and spe -\\ncific, we could add some details about how the artists', metadata={}),\n",
       " Document(page_content='behavior and preferences?\\nGPT-4:  News media websites will use data such as the \\nreader’s location, the type of device they are using, their \\nbrowsing history, and the articles they have previously \\nread or interacted with. They can also analyze which \\narticles the reader has shared on social media, how long \\nthey spend on the website, and which sections they visit \\nmost frequently. Additionally, data on the reader’s age, \\ngender, and interests can be used to personalize con-\\ntent. This data can be obtained through cookies, surveys, \\nand user accounts, among other methods. By analyzing 92Impromptu: Amplifying Our Humanity Through AI\\nthis data, news media websites can tailor their content \\nto the reader’s preferences and interests, providing a \\nmore personalized experience.\\nIn other words, the traditional approach. Not too innovative.\\nMeanwhile, as I noted above, the “conversational chat” \\napproach that GPT-4 and especially its sibling ChatGPT were', metadata={})]"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们把这个查询传给文档搜索系统，使用相似度匹配搜索 `similarity_search`，在向量数据中寻找与查询最匹配的文档。\n",
    "\n",
    "我们的搜索结果将包括与查询最接近的四个文档，而这些文档都是通过我们的嵌入函数进行嵌入的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs)  # 结果为 4 代表有 4 处地方跟问题有关系"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们尝试打印第一个 `docs[0]`\n",
    "在我们的搜索结果中，首位的文档中多次提到了“社交媒体”，看来我们的查询效果还是很好的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='cian, GPT-4 and ChatGPT are not only able but also incredi-\\nbly willing to focus on whatever you want to talk about.4 This \\nsimple dynamic creates a highly personalized user experience. \\nAs an exchange with GPT-4 progresses, you are continuously \\nfine-tuning it to your specific preferences in that moment. \\nWhile this high degree of personalization informs whatever \\nyou’re using GPT-4 for, I believe it has special salience for the \\nnews media industry.\\nImagine a future where you go to a news website and use \\nqueries like these to define your experience there:\\n4  Provided it doesn’t violate the safety restrictions OpenAI has put on \\nthem.93Journalism\\n● Hey, Wall Street Journal, give me hundred-word summa-\\nries of your three most-read tech stories today.\\n● Hey, CNN, show me any climate change stories that hap-\\npened today involving policy-making.\\n● Hey, New York Times, can you create a counter-argument \\nto today’s Paul Krugman op-ed, using only news articles', metadata={})"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这就是如何利用 OpenAI 技术处理 PDF 文档，将海量的信息提炼为可用的数据的全部步骤。是不是很简单，赶紧动手做起来吧~\n",
    "\n",
    "我们现在只有一个 PDF 文档，实现代码也很简单，Langchain 给了很多组件，我们完成得很快。接下来，我们处理多文档的提问，现实是我们要获取到真实的信息，通过会跨越多个文档，才能提取有用的信息。比如读取金融研报，新闻综合报道等等。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###   进阶 Stuff 链"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上一节中，我们加载了一个 PDF 文档，转化格式，切分字符后，创建向量数据来进行搜索匹配获得了问题的答案。一旦我们有了已经处理好的文档，我们就可以开始构建一个简单的问答链。现在我们看看 如何使用 Langchain 构建问答链。 \n",
    "\n",
    "在这个过程中，我们使用了 OpenAI 的模型，并选择了 Langchain 的现有文档处理链中 一种被称为 \"stuff\" 的链类型。在这种模式下，我们只是将所有内容都放在一个调用中，理想情况下，我们放入的内容应该少于 4000 个令牌。\n",
    "\n",
    "除了 \"stuff\" 之外，Langchain 文档处理链还有 精化（Refine）、Map reduce 、重排（Map re-rank）。后面我们会再次用到。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains.question_answering import load_qa_chain\n",
    "from langchain.llms import OpenAI\n",
    "chain = load_qa_chain(OpenAI(), \n",
    "                      chain_type=\"stuff\") # we are going to stuff all the docs in at once"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 运行链，构建查询\n",
    "\n",
    "下一步，我们要构建我们的查询。首先，我们使用向量存储中返回的内容作为上下文片段来回答我们的问题。然后，我们将这个查询传给语言模型链。语言模型链会回答这个查询，给出相应的答案。\n",
    "\n",
    "例如，我们可能会问 \"这本书的作者是谁？\"，然后将该查询传递给向量存储进行相似性搜索。系统会返回最相似的四个文档，我们将这些文档传递给语言模型链并给出查询，然后系统会给出一个答案。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' 不知道'"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"这本书是哪些人创作的？请用中文回答\"\n",
    "docs = docsearch.similarity_search(query)\n",
    "chain.run(input_documents=docs, question=query)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 选择返回的文档数量"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以设置返回的文档数量。默认情况下，系统会返回四个最相关的文档，但我们可以更改这个数字。\n",
    "\n",
    "例如，我们可以设置返回前六个或更多的搜索结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' 这本书的作者是Reid Hoffman和Sam Altman。'"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"这本书的作者是谁\"\n",
    "docs = docsearch.similarity_search(query,k=6)\n",
    "chain.run(input_documents=docs, question=query)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 进阶 map_rerank 链"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了解决这个问题，我们可以更改链类型。我们在之前的文章中看过许多不同的链类型。\n",
    "\n",
    "\"stuff\" 类型优势是把所有内容都放在一起的地方。任何时候我们可以使用 \"stuff\"，最好就使用它，通用且节省成本。\n",
    "\n",
    "我们还可以使用 \"map_reduce\" 在并行计算中对每个文档进行操作，但这可能会导致对 API 进行过多的调用，增加成本。\n",
    "\n",
    "继续我们的讨论，我们将深入了解如何通过 Langchain 技术从 PDF 文档中提取有用的信息，特别是我们将重点讨论如何处理多个查询和理解返回结果。\n",
    "\n",
    "第一种方式是使用不同类型的查询链类型。这里我们使用 `map_rerank` 这种类型，提高查询的质量。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### `map_rerank` 优化查询质量\n",
    "\n",
    "让我们从提出更复杂的查询开始。比如说，我们想要知道 \"OpenAI 是什么\"，并且我们想要获取前 10 个最相关的查询结果。在这种情况下，OpenAI 会返回多个答案，而不仅仅是一个。我们可以看到它不只返回一个答案，而是根据我们的需求返回了每个查询的答案和相应的评分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'intermediate_steps': [{'answer': ' This document does not answer the question',\n",
       "   'score': '0'},\n",
       "  {'answer': ' This document does not answer the question.', 'score': '0'},\n",
       "  {'answer': ' This document does not answer the question.', 'score': '0'},\n",
       "  {'answer': ' This document does not answer the question.', 'score': '0'},\n",
       "  {'answer': ' This document does not answer the question.', 'score': '0'},\n",
       "  {'answer': ' This document does not answer the question.', 'score': '0'},\n",
       "  {'answer': ' OpenAI 的创始人是 Elon Musk、Sam Altman、Greg Brockman、Ilya Sutskever 和卡尔·施密特。',\n",
       "   'score': '100'},\n",
       "  {'answer': ' This document does not answer the question', 'score': '0'},\n",
       "  {'answer': ' This document does not answer the question', 'score': '0'},\n",
       "  {'answer': ' This document does not answer the question', 'score': '0'}],\n",
       " 'output_text': ' OpenAI 的创始人是 Elon Musk、Sam Altman、Greg Brockman、Ilya Sutskever 和卡尔·施密特。'}"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.chains.question_answering import load_qa_chain\n",
    "from langchain.llms import OpenAI\n",
    "\n",
    "chain = load_qa_chain(OpenAI(), \n",
    "                      chain_type=\"map_rerank\",\n",
    "                      return_intermediate_steps=True\n",
    "                      ) \n",
    "\n",
    "query = \"OpenAI 的创始人是谁?\"\n",
    "docs = docsearch.similarity_search(query,k=10)\n",
    "results = chain({\"input_documents\": docs, \"question\": query}, return_only_outputs=True)\n",
    "results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "重要的参数是 `return_intermediate_steps=True`, 设置这个参数我们可以看到 `map_rerank` 是如何对检索到的文档进行打分的。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#####  理解评分系统\n",
    "\n",
    "OpenAI 技术对返回的每个查询结果进行了评分。比如说，OpenAI 在这本书中被多次提及，因此它的评分可能会有 80 分，90 分甚至 100 分。我们可以假设 OpenAI 可能选择了评分为 100 分的两个或三个查询，然后将它们合并，最终给出了我们的输出。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "评分后，模型输出一个最终的答案, `'score': '100'` 得分 100 的那个答案："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' OpenAI 的创始人是 Elon Musk 和 Sam Altman。'"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results['output_text'] "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了搞清楚为什么模型会评分，做出判断，我们可以打印 prompt 提示模板："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\\n\\nIn addition to giving an answer, also return a score of how fully it answered the user's question. This should be in the following format:\\n\\nQuestion: [question here]\\nHelpful Answer: [answer here]\\nScore: [score between 0 and 100]\\n\\nHow to determine the score:\\n- Higher is a better answer\\n- Better responds fully to the asked question, with sufficient level of detail\\n- If you do not know the answer based on the context, that should be a score of 0\\n- Don't be overconfident!\\n\\nExample #1\\n\\nContext:\\n---------\\nApples are red\\n---------\\nQuestion: what color are apples?\\nHelpful Answer: red\\nScore: 100\\n\\nExample #2\\n\\nContext:\\n---------\\nit was night and the witness forgot his glasses. he was not sure if it was a sports car or an suv\\n---------\\nQuestion: what type was the car?\\nHelpful Answer: a sports car or an suv\\nScore: 60\\n\\nExample #3\\n\\nContext:\\n---------\\nPears are either red or orange\\n---------\\nQuestion: what color are apples?\\nHelpful Answer: This document does not answer the question\\nScore: 0\\n\\nBegin!\\n\\nContext:\\n---------\\n{context}\\n---------\\nQuestion: {question}\\nHelpful Answer:\""
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the prompt\n",
    "chain.llm_chain.prompt.template"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RetrievalQA 链"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "除了单个查询，我们还可以使用链式查询。我们可以开始将这些查询放在一个链条中。\n",
    "\n",
    "链式查询指的是查询向量存储链和语言模型的链。\n",
    "\n",
    "例如我们可以有一个查询向量存储的链，以及一个查询语言模型的链。然后，我们可以将这些链条组合在一起，创建一个检索 QA 链条 。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 使用 RetrievalQA 链\n",
    "\n",
    "RetrievalQA 链是 Langchain 已经封装好的索引查询问答链。实例化之后，我们可以直接把问题扔给它，而不需要 `chain.run()`。简化了很多步骤，获得了比较稳定的查询结果。\n",
    "\n",
    "为了创建这样的链，我们需要一个检索器。我们可以使用之前设置好的 docsearch, 作为检索器，并且我们可以设置返回的文档数量 `\"k\":4`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "docsearch = FAISS.from_texts(texts, embeddings) # 检索器是向量库数据"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以将这些参数传递给链条类型 \"stuff\"，它会为我们返回源文档。（选择 \"stuff\" 类型的原因：跟第一个 \"stuff\" 类型 和 `map_reduce` 类型对比答案的质量）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "# set up FAISS as a generic retriever \n",
    "retriever = docsearch.as_retriever(search_type=\"similarity\", search_kwargs={\"k\":4})  # as_retriever方法是构建检索器 \n",
    "\n",
    "# create the chain to answer questions \n",
    "rqa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model=\"gpt-3.5-turbo-0613\"), \n",
    "                                  chain_type=\"stuff\", \n",
    "                                  retriever=retriever, \n",
    "                                  return_source_documents=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 返回结果和源文档\n",
    "\n",
    "当我们查询 \"OpenAI 是什么\" 时，我们不仅会得到一个答案，还会得到源文档 `source_documents`。源文档是返回结果的参考文档，它可以帮助我们理解答案是如何得出的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OpenAI是一个人工智能研究实验室和公司，致力于推动人工智能的发展和应用。它由一些科技行业的重要人物共同创立，包括LinkedIn的联合创始人Reid Hoffman。OpenAI的目标是创建人类友好的人工智能，并将其推广应用于各个领域，以促进社会的进步和发展。'"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"OpenAI 是什么?\"\n",
    "rqa(query)['result']"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 直接返回结果设置\n",
    "\n",
    "如果我们不需要中间步骤和源文档，只需要最终答案，那么我们可以直接请求返回结果。将代码:return_source_documents=True  改为 return_source_documents=False"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "比如说，我们问 \"What does gpt-4 mean for creativity?\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'GPT-4对创新力有着潜在的影响。首先，GPT-4可以作为创意助手帮助人类创作者进行头脑风暴、编辑、反馈、翻译和市场营销等任务。它可以与人类艺术家进行成功的合作，如OpenAI的Jukebox、DALL-E和MuseNet等项目。\\n\\n其次，GPT-4可以生成图像、音乐、视频和其他形式的媒体，这对于艺术家和创作者来说是一个巨大的机会。它可以为创作者提供灵感、创意和技术支持，帮助他们在创作过程中提高效率和质量。\\n\\n然而，GPT-4也可能带来一些挑战。一方面，由于GPT-4是基于大量数据训练而来的，它可能会受到数据偏见的影响，导致创作中出现一些不平衡或偏颇的结果。另一方面，GPT-4生成的作品可能缺乏原创性和个性化，因为它是通过学习和模仿已有作品而生成的。\\n\\n此外，GPT-4的广泛应用可能导致一些问题，如知识产权和版权的问题。由于GPT-4可以生成大量的内容，可能会引发关于创作权的争议和纠纷。\\n\\n总之，GPT-4对创新力的影响具有积极和负面的潜力。它可以为创作者提供支持和启发，但也需要注意其潜在的局限性和挑战。'"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"gpt-4 对创新力有什么影响?\"\n",
    "rqa(query)['result']"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
